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Abstract 


We propose a new stochastic L-BFGS algo¬ 
rithm and prove a linear convergence rate 
for strongly convex and smooth functions. 
Our algorithm draws heavily from a recent 
stochastic variant of L-BFGS proposed in 


Byrd et al. (20141 as well as a recent approach 


to variance reduction for stochastic gradi¬ 


ent descent from Johnson and Zhang (20131. 


We demonstrate experimentally that our al¬ 
gorithm performs well on large-scale convex 
and non-convex optimization problems, ex¬ 
hibiting linear convergence and rapidly solv¬ 
ing the optimization problems to high levels 
of precision. Furthermore, we show that our 
algorithm performs well for a wide-range of 
step sizes, often differing by several orders of 
magnitude. 


1 Introduction 

A trend in machine learning has been toward using 
more parameters to model larger datasets. As a con¬ 
sequence, it is important to design optimization algo¬ 
rithms for these large-scale problems. A typical op¬ 
timization problem arising in this setting is empirical 
risk minimization. That is, 

1 ^ 

i=l 


When d is small, Newton’s method is often the algo¬ 
rithm of choice due to its rapid convergence (both in 
theory and in practice). However, Newton’s method 
requires the computation and inversion of the Hessian 
matrix V^/(r/;), which may be computationally too ex¬ 
pensive in high dimensions. As a consequence, prac¬ 
titioners are often limited to using first-order meth¬ 
ods which only compute gradients of the objective, 
requiring 0{d) computation per iteration. The gra¬ 
dient method is the simplest example of a first-order 
method, but much work has been done to design quasi- 
Newton methods which incorporate information about 
the curvature of the objective without ever computing 
second derivatives. L-BFGS (Liu and Nocedal 19891, 
the limited-memory version of the classic BFGS algo¬ 
rithm, is one of the most successful algorithms in this 
space. Inexact Newton methods are another approach 
to using second order information for large-scale op¬ 
timization. They approximately invert the Hessian 
in 0{d) steps. This can be done by using a constant 
number of iterations of the conjugate gradient method 


(Dembo et al. 

1982 

Dembo and Steihaug 

1983 

No- 

cedal and Wright 

2006 

I. 


When N is large, batch algorithms such as the gradient 
method, which compute the gradient of the full objec¬ 
tive at every iteration, are slowed down by the fact 
that they have to process every data point before up¬ 
dating the model. Stochastic optimization algorithms 
get around this problem by updating the model w af¬ 
ter processing only a small subset of the data, allowing 
them to make much progress in the time that it takes 
the gradient method to make a single step. 


where w G may specify the parameters of a ma¬ 
chine learning model, and fi{w) quantifies how well 
the model w fits the zth data point. Two challenges 
arise when attempting to solve Equation First, d 
may be extremely large. Second, N may be extremely 
large. 
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For many machine learning problems, where both d 
and N are large, stochastic gradient descent (SGD) 
and its variants are the most widely used algorithms 


(Robbins and Monro 

1951 Bottou 

12010 

Bottou and 

LeGun 

2004 

I, often because they are some of the few 


algorithms that can realistically be applied in this set¬ 
ting. 


Given this context, much research in optimization has 
been directed toward designing better stochastic first- 


order algorithms. For a partial list, see (Kingma and 
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astic variants of L-BFGS (Mokhtari and Ribeiro 

2014a| Wang et al. 2014 Byrd et al. 

20141 Bordes 

et al. 

2009 Schraudolph et al. 20071 

Sohl-Dickstein 

et al. 

2014). 


Unlike gradient descent, L-BFGS does not immedi¬ 
ately lend itself to a stochastic version. The updates 
in the stochastic gradient method average together to 
produce a downhill direction in expectation. However, 


as pointed out in Byrd et al. (20141, the updates used 


in L-BFGS to construct the inverse Hessian approx¬ 
imation overwrite one another instead of averaging. 
Our algorithm addresses this problem in the same ways 
as Byrd et al. (2014), by computing Hessian vector 


products formed from larger minibatches. 


Though stochastic methods often make rapid progress 
early on, the variance of the estimates of the gradi¬ 
ent slow their convergence near the optimum. To il¬ 
lustrate this phenomenon, even if SGD is initialized 
at the optimum, it will immediately move to a point 
with a worse objective value. For this reason, con¬ 
vergence guarantees typically require diminishing step 
sizes. One promising line of work involves speeding up 
the convergence of stochastic first-order methods by 


reducing the variance of the gradient estimates ( 

John- 

son and Zhangl 20131 IRoux et al. 

2012 

IDefazio et al. 

2014 Shalev-Shwartz and Zhang 

2013 

1 . 


We introduce a stochastic variant of L-BFGS that in¬ 
corporates the idea of variance reduction and has two 
desirable features. First, it obtains a guaranteed lin¬ 
ear rate of convergence in the strongly-convex case. In 
particular, it does not require a diminishing step size in 
order to guarantee convergence (as partially evidenced 
by the fact that if our algorithm is initialized at the 
optimum it will stay there). Second, it performs very 
well on large-scale optimization problems, exhibiting 
a qualitatively linear rate of convergence in practice. 


2 The Algorithm 

We consider the problem of minimizing the function 

1 ^ 

= ( 2 ) 

over w S For a subset S C {!,..., N}, we define 
the subsampled function fs by 

fs{w) = (3) 


Our updates will use stochastic estimates of the gra¬ 
dient V/s as well as stochastic ap proximation s to th e 
inverse Hessian V^/ 7 -. Following Byrd et al. (2014), 


we use distinct subsets 5,T C {l,...,iV} in order to 
decouple the estimation of the gradient from the esti¬ 
mation of the Hessian. We let b = |iS| and bn = |T|. 

Following Johnson and Zhang ( 2013| ), we occasionally 
compute full gradients, which we use to reduce the 
variance of our stochastic gradient estimates. 


The update rule for our algorithm will take the form 


Wk+i =Wk- ’nkHkVk- 

In the gradient method, is the identity ma¬ 
trix. In Newton’s method, it is the inverse Hes¬ 
sian f{wk))~^■ In our algorithm, as in L- 

BFGS, Hk will be an approximation to the inverse 
Hessian. Instead of the usual stochastic estimate of 
the gradient, Vk will be a stochastic estimate of the 
gradient with reduced variance. 

Code for our algorithm is given in Algorithm Our 
algorithm is specified by several parameters. It re¬ 
quires a step size ry, a memory size M, and positive 
integers m and L. Every m iterations, the algorithm 
performs a full gradient computation, which it uses 
to reduce the variance of the stochastic gradient esti¬ 
mates. Every L iterations, the algorithm updates the 
inverse Hessian approximation. The vector s^. records 
the average direction in which the algorithm has made 
progress over the past 2L iterations. The vector yr 
is obtained by multiplying Sr by a stochastic estimate 
of the Hessian. Note that this differs from the usual 
L-BFGS algorithm, which produces yr by taking the 
difference between successive gradients. We find that 
this approach works better in the stochastic setting. 
The inverse Hessian approximation Hr is defined from 
the pairs {sj,yj) for r — M-l -1 < j < r using the 
standard L-BFGS update rule, which is described in 
Section im The user must also choose batch sizes b 
and bn from which to construct the stochastic gradient 
and stochastic Hessian estimates. 

In Algorithm and below, we use I to refer to the 
identity matrix. We use J^k,t to denote the sigma alge¬ 
bra generated by the random variables introduced up 
to the time when the iteration counters k and t have 
the specified values. That is, 

-P _ ( {Sk',t' ■ k' <k 01 k' = k and t' <t} \ 

^ \y (j { 7 ^ : rL < mk + t} J ' 

We will use to denote the conditional expectation 
with respect to J-k,t- 

We define the inverse Hessian approximation Hr in 
Section |2.1[ Note that we do not actually construct 
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Algorithm 1 Stochastic L-BFGS 

Input: initial state Wq, parameters m, M, and L, batch sizes b and bn, and step size 77 
1 : Initialize r = 0 
2 : Initialize Hq = I 

3: for k — 0,... do 

4: Compute a full gradient /ifc = f{wk) 

5: Set Xq = Wk 

6 : for t = 0 ,..., TO — 1 do 

7: Sample a minibatch Sk,t C {1,..., N} 

8 : Compute a stochastic gradient ^fsk,ti^t) 

9: Compute a variance reduced gradient Vt = fSk,ti'^k) + Mfc 

10 : Set Xt+i = Xt — rjHrVt 

11 : if t = 0 mod L then 

12 : Increment r r + 1 

13: Set Ur = ^j=t-L 

14: Sample 7^ C {1,..., N} to define the stochastic approximation V^/r„(ur-) 

15: Compute Sr = Ur — Ur-l 

16: Compute yr = f%.i'^r)Sr 

17: Define Hr as in Section [2d] 

18: Set Wk+i = Xi for randomly chosen * G {0,..., to — 1} 


the matrix Hr because doing so would require 0{cP) 
computation. In practice, we directly compute prod¬ 


ucts of the form HrV using the two-loop recursion (No- 


cedal and Wright 2006, Algorithm 7.4). 


2.1 Construction of the Inverse Hessian 
Approximation Hr 

To define the inverse Hessian approximation Hr from 
the pairs {sj,yj), we follow the usual L-BFCS method. 
Let pj = 1/sj yj and recursively define 


for all w G and all nonempty subsets H C 
A}. Note the lower bound trivially holds in 
the regularized case. 

We will typically force strong convexity to hold by 
adding a strongly-convex regularizer to our objective 
(which can be absorbed into the ffs). These assump¬ 
tions imply that / has a unique minimizer, which we 
denote by re*. 

Lemma 3. Suppose that Assumption^^and Assump¬ 
tion^ hold. Let Br = Hf^. Then 


H^^ ^\l-pjSjy])+pjS:js], (4) 

for r — M -I- 1 < j < r. Initialize Hr^ = 
{sJVr/WyrW^)! and set Hr = Hr^\ 

Note that the update in Equation preserves positive 
definiteness (note that pj > 0 ), which implies that Hr 
and each Hr will be positive definite, as will their 
inverses. 


tT{Br) < {d-\- M)A 

yl+M 

det{Br) > 


((d-f M)A)^' 


We prove Lemma in Section [7T| 

Lemma 4. Suppose that Assumption^ and Assump- 
tion^hold. Then there exist constants 0 < 7 < F such 
that Hr satisfies 


3 Preliminaries 

Our analysis makes use of the following assumptions. 

Assumption 1. The function fi: K" —>■ M is convex 
and twice continuously differentiable for each 1 < i < 
N. 

Assumption 2. There exist positive constants A 
and A such that 

A/ ^ '^^friw) ^ A/ (5) 


< Hr < TI (6) 


for all r > 1 . 

In Section 7.2 we prove Lemma with the values 


7 = 


(d + M)A 


and F = 


((d + M)A) 


d+M-l 


Xd+M 


We will make use of Lemma a simple result for 
strongly convex functions. We include a proof for com¬ 
pleteness. 
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Lemma 5. Suppose that f is continuously differen¬ 
tiable and strongly convex with parameter A. Let w* 
he the unique minimizer of f. Then for any x G 
we have 

\\Vf{xW>2\{f{x)- f{wff). 


Conditioning on Fk,t and taking expectations in Equa¬ 
tion this becomes 

Efc,t[/(a^t-ti)] (10) 

< f{xt) - r]Vf{xt)^HrVfixt) 


Proof. By the strong convexity of /, 

fiw*) > f{x) -f Vf{x)^{w* - x) -h - xf 

> /(x) -f mm ^V/(x)^x -h ^ 

= fix) - ^||V/(x)f. 

The last equality holds by plugging in the mini¬ 
mizer u =—V/(x)/A. □ 

In Lemma we bound the variance of our variance- 
reduced gradient estimates. The proof of Lemma 

7.3[ closely follows that of [Johnson] 
Theorem 1). 

Lemma 6. Let he the unique minimizer of f. 
LetfXk = V/(wfc) and letvt = ^fsixt)-'^fsiwk)+tJ-k 
he the variance-reduced stochastic gradient. Condition¬ 
ing on Pk,t and taking an expectation with respect to S, 
we have 

]Efc,t[lkt|P] < 4A(/(xt)-/(w*)-h/('iCfe)-/(w*)). (7) 

4 Convergence Analysis 

Theorem [T] states our main result. 

Theorem 7. Suppose that Assumption^and Assump- 
tion^hold. Let w^, be the unique minimizer of f. Then 
for all k > 0, we have 

E[/('«'fc) - fiw*)] < a’^Eifiwo) - fiw*)], 

where the convergence rate a is given by 

1 /{2mr])-\-riV^ 

“ ^ qA - r?r 2 A 2 ^ ’ 

assuming that we choose ij < 7 A/( 2 r^A^) and that we 
choose m large enough to satisfy 


where we used the fact that = V/(xt). We 

then use Lemma|4]to bound the second and third terms 
on the bottom line of Equation [I^ to get 

EkAfixt+i)] < /(xt)-T? 7 l|V/(xt)f-H ^ ^ ^ Efc,t||utf. 

Now, we bound Efc^t||ut||^ using Lemma and we 
bound ||V/(xt)|p using LemmaDoing so gives 

Ekfffixt+i)] 

< f{xt) - 2r]'yX{f{xt) - fiw*)) 

-\- 2ffT'^A'^{f{xt) - fiw*) -\- fiwk) - fiw*)) 

= fixt) - - riT^AAifixt) - fiw*)) 

+ 2ffT^AAfiwk)-fiw*)). 

Taking expectations over all random variables, sum¬ 
ming over t = 0,..., TO — 1, and using a telescoping 
sum gives 

E[/(xm)] 

< E[/(xo)] -f 2mr]'^T^A^E[fiwk) - fiw*)] 

- 2?7(7A - r^r^A^) E[/{xt)] - to/(w*)^ 

= R[fiwk)] + 2mffT‘^A'^E[fiwk) - fiw*)] 

- 2mr]A\ - r 7 r^A^)E[/('u;fe+i) - fiw*)]. 
Rearranging the above gives 

0 < E[fiwk) - fixm)] + 2mr]'^r‘^A‘^E[fiwk) - /(w*)] 

- 2mr]i-fX - pV'^AAElfiwk+i) - fiw*)] 

< E[fiwk) - fiw*)] -G 2mAT^A'^E[fiwk) - fiw*)] 

- 2mpijX - 77r^A^)E[/(wfe+i) - fiw*)] 

= (1 + 2TO7?"r"A^)E[/{u;fc) - fiw*)] 

- 2mr]AX - ? 7 r^A^)E[/(uifc+i) - fiw*)]. 


given in Section 


and Zhang (2013 


7 A > ^ + 2pr^A^. (8) 

2mrj 

Proof. Using the Lipschitz continuity of V/, which fol¬ 
lows from Assumption!^ we have 

fixt-\-i) (9) 

< fixt) + V/(xt)^(xt+i - xt) -\- l||xt+i - Xtf 

= fixt) - r]Vfixt)^HrVt -\- ''^WHkVtA. 


The second inequality follows from the fact 
that fiw*) < /(xm). Using the fact that 77 < 
7 A/( 2 r^A^), it follows that 


< 


E[/(wfc+i) - fiw*)] 
1 -I- 2mrf'T^A^ 


2mr]AX — pV^AA 


Hfiwk) - fiw*)]. 


Since we chose to and rj to satisfy Equation]^ it follows 
that the rate a is less than one. This completes the 
proof. □ 
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Figure 1: The left figure plots the log of the optimization error as a function of the number of passes through 
the data for SLBFGS, SVRG, SQN, and SGD for a ridge regression problem (Millionsong). The middle figure 
does the same for a support vector machine (RCVl). The right plot shows the training loss as a function of the 
number of passes through the data for the same algorithms for a matrix completion problem (Netflix). 
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Figure 2: These figures show the log of the optimization error for SLBFGS, SVRG, SQN, and SGD on a ridge 
regression problem (millionsong) for a wide range of step sizes. 


5 Related Work 

There is a large body of work that attempts to 
improve on stochastic gradient descent by reducing 
Shalev-Shwartz and Zhang| (|2013 1 pro- 


variance. 


pose stochastic dual coordinate ascent (SDCA).|Roux 


et al. (2012[) propose the stochastic average gradient 


method (SAG). Johnson and Zhang (2013) propose the 


stochastic variance reduced gradient (SVRG). Wang 


et al. (20131 develop an approach based on the con¬ 


struction of control variates. More recently, |Frostig| 


et al. (2015) devise an online version of SVRG that 


uses streaming estimates of the gradient to perform 
variance reduction. 

^Similarly, a number of stochastic quasi-Newton meth¬ 


ods have been proposed. Bordes et al. (2009) propose a 


variant of stochastic gradient descent that makes use 
of second order information. IMokhtari and Ribeirol 


(2014a) analyze the straightforward application of L- 


BFGS in the stochastic setting and prove a 0(1/k) 
convergence rate in the strongly-convex setting. |Byrd| 
et al. (2014) propose a modified version of L-BFGS 


in the stochastic setting and prove a 0(1/k) con¬ 
vergence rate in the strongly-convex setting. |Sohl-| 


Dickstein et al. (2014) propose a stochastic qnasi- 


Newton method for minimizing snms of functions by 
maintaining a separate approximation of the inverse 
Hessian for each function in the sum. |Schraudolph| 
et al. (20071 develop a stochastic version of L-BFGS 
for the online convex optimization setting. |Wang| 
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Figure 3: These figures show the log of the optimization error for SLBFGS, SVRG, SQN, and SGD on a support 
vector machine (RCVl) for a wide range of step sizes. 


et al. (20141 prove the convergence of various stochas¬ 


tic quasi-Newton methods in the nonconvex setting. 
Our work differs from the preceding in that we guar¬ 
antee a linear rate of convergence. 


Lucchi et al. (2015) independently propose a variance- 


reduction procedure to speed up stochastic quasi- 
Newton methods and to achieve a linear rate of conver¬ 
gence. Their approach to updating the inverse-Hessian 
approximation is similar to that of L-BFGS, whereas 
our method leverages Hessian-vector products to sta¬ 
bilize the approximation. 


6 Experimental Results 


To probe our theoretical results, we compare Algo¬ 
rithm]^ (SLBFGS) to the stochastic variance-reduced 
gradient method (SVRG) ( [Johnson and Zhang] 2013|), 


the stochastic quasi-Newton method (SQN) (Byrd 


et al. 2014), and stochastic gradient descent (SGD). 


We evaluate these algorithms on several popular ma¬ 
chine learning models, including ridge regression, sup¬ 
port vector machines, and matrix completion. Our 
experiments show the effeciveness of the algorithm 
on real-world problems that are not neccessarily 
(strongly) convex. 


Because SLBFGS and SVRG require computations of 
the full gradient, each epoch requires an additional 
pass through the data. Additionally, SLBFGS and 


SQN require Hessian-vector-product calculations, each 
of which is about as expensive as a gradient calcu¬ 
lation Pearlmutter (1994). The number of Hessian- 
vector-product computations per epoch introduced by 
this is {b}{N)/{bL), which in our experiments is ei¬ 
ther N or 2N. To incorporate these additional costs, 
our plots show error with respect to the number of 
passes through the data (that is, the number of gra¬ 
dient or Hessian-vector-product computations divided 
by TV). For this reason, the first iterations of SLBFGS, 
SVRG, SQN, and SGD all begin at different times, 
with SGD appearing first and SLBFGS appearing last. 


For all experiments, we set the batch size b to either 20 
or 100, we set the Hessian batch size bn to 106 or 206, 
we set the Hessian update interval L to 10, we set 
the memory size M to 10, and we set the number of 
stochastic updates m to N/b. We optimize the learn¬ 
ing rate via grid search. SLBFGS and SVRG use a 
constant step size. For SQN and SGD, we try three 
different step-size schemes: constant, l/-\/t, and 1/t, 
and we report the best one. All experiments are ini¬ 
tialized with a vector of zeros, except for the matrix 
completion problem, where in order to break symme¬ 
try, we initialize the experiments with a vector of stan¬ 
dard normal random variables scaled by 10“^. 


First, we performed ridge regression on the million- 
song dataset (Bertin-Mahieux et al., 2011) consisting 
of approximately 4.6 x 10^ data points. We set the reg- 
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ularization parameter A = 10“^. In this experiment, 
both SLBFGS and SVRG rapidly solve the problem 
to high levels of precision. Second, we trained a sup¬ 
port vector machine on RCVl (Lewis et al. 20041, 
with approximately 7.8 x 10® data points. We set the 
regularization parameter to A = 0. In this experi¬ 
ment, SGD and SQN make more progress initially as 
expected, but SLBFGS finds a better optimum. Third, 
we solve a nonconvex matrix completion problem on 
the Netflix Prize dataset, as formulated in|Recht and| 


Re (2013), with approximately 10® data points. We set 
the regularization parameter to A = 10“^. The poor 
performance of SVRG and SGD on this problem may 
be accounted for by the fact that the algorithms are 
initialized near the vector of all zeros, which is a sta¬ 
tionary point (though not the optimum). Presumably 
the use of curvature information helps SLBFGS and 
SQN escape the neighborhood of the all zeros vector 
faster than SVRG and SGD. 


Similarly, letting Zj = (V^/ 7 -and noting 
that 

hjV ^ 

sjyj zjzj 


Assumption again implies that 


A < 


11 %- 




< A. 


( 12 ) 


Note that using the Sherman-Morrison-Woodbury for¬ 
mula, we can equivalently write Equation in terms 
of the Hessian approximation = H~^ as 


Figure plots a comparison of these methods on the 
three problems. For the convex problems, we plot the 
logarithm of the optimization error with respect to a 
precomputed reference solution. For the nonconvex 
problem, we simply plot the objective value as the 
global optimum is not necessarily known. 

6.1 Robustness to Choice of Step Size 

In this section, we illustrate that SLBFGS performs 
well on convex problems for a large range of step sizes. 
The windows in which SVRG, SQN, and SGD per¬ 
form well are much narrower. In Figure we plot 
the performance of SLBFGS, SVRG, SQN, and SGD 
for ridge regression on the millionsong dataset for step 
sizes varying over a couple orders of magnitude. In 
Figure we show a similar plot for a support vector 
machine on the RGVl dataset. In both cases, SLBFGS 
performs well, solving the problem to a high degree of 
precision over a large range of step sizes, whereas the 
performance of SVRG, SQN, and SGD degrade much 
more rapidly with poor step-size choices. 

7 Proofs of Preliminaries 

7.1 Proof of Lemma [ 3 ] 


bO) 


5 O- 1 ) 


BA~^Ajs] bA~^^ 
sj bA 


vjyJ 

yj 


(13) 


We will begin by bounding the eigenvalues of Br. We 
will do this indirectly by bounding the trace and de¬ 
terminant of Br- We have 


tiiB^A = tiiBA-^A - 


tr(R 


^AisjB^ ^^) tviyjyJ) 






-k 


yj^3 


= tr(H(^-i)) - 


sJbA~^Aj 


11% I 


yJ %■ 




y sj 


< tr(H(^-i)) -k A. 


The first equality follows from the linearity of the trace 
operator. The second equality follows from the fact 
that tr(AR) = tr(i?A). The fourth relation follows 
from Equation [T^ Since 


The analysis below closely follows many other anal¬ 
yses of the inverse Hessian approximation used in L- 


BFGS ( 

Nocedal and Wright 

2006 Byrd et al. 

2014 

Mokhtari and Ribeiro 

2014a|bl, and we include it for 


completeness. 


Note that sjyj = Sj\7^f-rAuj)sj, it follows from As¬ 
sumption that 

A||s,f <sj2/,<A||s,f- (11) 


tr(R(°)) = < dA, 

s; yr 


it follows inductively that 


tr(Rfc) < (d -k M)A. 
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Now to bound the determinant, we write 
det(i3^'^^) = det(i3^'’“^^) 

Acih 


y sjBr^ Sj 

= det(i37-^)) 

sj By Sj 

J “* 


2^7 


■SjlP s]By~^\j 


s. 

> det(B7“^7 

> det(B7“^7 


> det(s7"^7 


A„,ax(B7”'7 

A 

tr(B7”^7 

A 

(d + M)A' 


Since we defined Br = , it follows that 

1 ^ ^ ^ ^{{d + M)Ky+^-\ 

(d + M)A - "- \d+M 

7.3 Proof of Lemma [6] 

Define the function gs(w) = fsiw) — fsiw*) — 
fsiw*)^{w — Wt) to get the linearization of fs 
around the optimum re*, and note that gs is mini¬ 
mized at w*. It follows that for any w, we have 

0 = gs{wy <gs(^w- ^V55(u;)^ < gsiw)-^\\Wgs\y 

Rearranging, we have 

l!v/5M-v/5K)f 

< 2A{fs{w) - /s(w,) - '^fs{w*Viw - u>*)). 


The first equality uses det(Ai?) = det(A) det(R). The 
second equality follows from the identity 

det(/-I-uiu7 + U2 w7 ) (14) 

= (1 -I- u7^i)(l + U 2 V 2 ) - {ujV2)ivJU 2 ) 


jO-i), 


)/{sjBy 


), 


by setting ui = —Sj, vi = (R 

U 2 = {By and V 2 = Vj/iyjsj). See Dennis 

and More ( 1977[ Lemma 7.6) for a proof, or simply 


note that Equation follows from two applications 
of the identity det(A -|- uv^) = (1 -f A~^u) det(A) 
when / -|- itiu7 is invertible and by continuity when 
it isn’t. The third equality follows by multiplying the 
numerator and denominator by ||sj|p. The fourth re¬ 
lation follows from Equation and from the fact that 
sjBy~^^Sj < Ainax(R7~^7lNilP- Ths fillli relation 
uses the fact that the largest eigenvalue of a positive 
definite matrix is bounded by its trace. The sixth re¬ 
lation uses the previous bound on tr(i37~^7- Since 


det(Byy = 

it follows inductively that 

det(i?j.) > 


s^yr 


> a7 


^d+M 


((d + M)A)^' 


Averaging over all possible minibatches S C 
{!,...,iV} of cardinality b and using the fact 
that V/(rc,) = 0, we see that 

i^b) E ll^/‘S(«^)-W5(«^*)f (15) 

A A 

< 2A{f{w) - /(w*))- 


Now, let yik = '^f{wk) and Vt = fs{xt)fs{wk)+ 
/ife. Conditioning on jFk,t and taking an expectation 
with respect to 5, we find 

] < 2EkA\\^fs{xt) - V/5(u;*)f ] (16) 

+ 2Efc,t[||V/5(u;fc) - Vfsiwy - Hkf] 

< 2EkA\\^fs{xt)-Vfs{wyf] 

+ 2Ek,t[\\^fs{wk)-Vfs{wAf] 

< 4A{f{xt) - f{wA + f{wk) - fiwA). 


The first inequality uses the fact that ||a -I- 5|p < 
2||a|p-|-2||&|p. The second inequality follows by noting 
that y-k = Efc,i[V/ 5 (wfe) - and that E[||^ - 

E[C]|7] < E[||CI7] for any random variable A Tli® third 


inequality follows from Equation 15 


8 Discussion 


7.2 Proof of Lemma |4] 

Using Lemma as well as the fact that Hj. is positive 
definite, we have 

Amax(^r) < tr(R,.) < {d + M)A. 

and 

det{BA ^ 

Xm^ABrY-^ - {{d + M)Ar+^-^- 


This paper introduces a stochastic version of L-BFGS 
and proves a linear rate of convergence in the strongly 
convex case. Theorem [^captures the qualitatively lin¬ 
ear rate of convergence of SLBFGS, which is reflected 
in our experimental results. We expect SLBFGS 
to outperform other stochastic first-order methods in 
poorly conditioned settings where curvature informa¬ 
tion is valuable as well in settings where we wish to 
solve the optimization problem to high precision. 
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There are a number of interesting points to address 
in future work. The proof of Theorem and many 
similar proofs used to analyze quasi-Newton methods 
result in constants that scale poorly with the prob¬ 
lem size. At a deeper level, the point of studying 
quasi-Newton methods is to devise algorithms that lie 
somewhere along the spectrum from gradient descent 
to Newton’s method, reaping the computational ben¬ 
efits of gradient descent and the rapid convergence of 
Newton’s method. Many of the proofs in the litera¬ 
ture, including the proof of Theorem bound the ex¬ 
tent to which the quasi-Newton method deviates from 
gradient descent by bounding the extent to which the 
inverse Hessian approximation deviates from the iden¬ 
tity matrix. Those bounds are then used to show that 
the quasi-Newton method does not perform too much 
worse than gradient descent. A future avenue of re¬ 
search is to study if stochastic quasi-Newton methods 
can be designed that provably exhibit superlinear con¬ 
vergence as has been done in the non-stochastic case. 
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